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Abstract. We investigate the fundamental statistical features of tagged (or 
annotated) networks having a rich variety of attributes associated with their nodes. 
Tags (attributes, annotations, properties, features, etc.) provide essential information 
about the entity represented by a given node, thus, taking them into account represents 
a significant step towards a more complete description of the structure of large complex 
systems. Our main goal here is to uncover the relations between the statistical 
properties of the node tags and those of the graph topology. In order to better 
characterise the networks with tagged nodes, we introduce a number of new notions, 
including tag-assortativity (relating link probability to node similarity), and new 
quantities, such as node uniqueness (measuring how rarely the tags of a node occur 
in the network) and tag-assortativity exponent. We apply our approach to three large 
networks representing very different domains of complex systems. A number of the 
tag related quantities display analogous behaviour (e.g., the networks we studied are 
tag-assortative, indicating possible universal aspects of tags versus topology), while 
some other features, such as the distribution of the node uniqueness, show variability 
from network to network allowing for pin-pointing large scale specific features of real- 
world complex networks. We also find that for each network the topology and the tag 
distribution are scale invariant, and this self-similar property of the networks can be 
well characterised by the tag-assortativity exponent, which is specific to each system. 
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1. Introduction 

Many complex systems in nature and society can be successfully represented in terms 
of networks capturing the intricate web of connections among the units they are made 
of [H E]. In the recent years, the research in this field have been focused mainly on 
the topology of the graphs corresponding to these real networks. Since this approach 
is rooted in, among others, statistical physics, where often the thermodynamic limit 
is considered and also the size of the known nets is becoming huge, several large- 
scale properties of real- world webs have been uncovered, e.g., a low average distance 
combined with a high average clustering coefficient [3], the broad (scale-free) distribution 
of node degree (number of links of a node) [U El El [7J and various signatures of 
hierarchical/modular organisation [SJE]. 

On the other hand, there has been a quickly growing interest in the local structural 
units of networks. Small and well defined sub-graphs consisting of a few vertices have 
been introduced as motifs pm EEE], whereas somewhat larger units, associated with more 
highly interconnected parts [12l H3l HH HSl HH HZl HH UHl [20l [2H [22l [23l [2H [251 ISS] are 
usually called communities, clusters, cohesive groups, or modules. These structural 
sub-units can correspond to multi-protein functional units in molecular biology [81 l2Tj. 
a set of tightly coupled stocks or industrial sectors in economy [281 [29], groups of people 
[T9l [301 151] . cooperative players [321 [331 El], etc. The location of such building blocks 
can be crucial to the understanding of the structural and functional properties of the 
systems under investigation. 

The majority of the complex network studies concern "bare" graphs corresponding 
to a simple list of connections between the nodes, or at most weighted networks 
where a connection strength (or intensity) is associated to the links. However, the 
introduction of node tags (also called as attributes, annotations, properties, categories, 
features) leads to a richer structure, opening up the possibility for a more comprehensive 
analysis of the systems under investigation. These tags can correspond to basically any 
information about the nodes and in most cases a single node can have several tags at 
the same time. The use of such annotations in biological networks is a common practice 
[351 l36l 1371 l38l 1391 HQ], where the tags usually refer to the biological function of the 
units (proteins, genes, etc.). Another interesting application of node features can be 
seen in the studies of co-evolving network models, where the evolution of the network 
topology affects the node properties and vice versa [^Tlli2lH3l l M li5 l l M li7 1 I M H9 l [50]. 
These models are aimed at describing the dynamics of social networks, in which people 
with similar opinion are assumed to form ties more easily, and the opinion of connected 
people becomes more similar in time. Finally, we mention the study of collaborative 
tagging in Ref. [51], where tripartite networks were constructed from data concerning 
users who associated tags to some kind of items, (such as music listeners classifying 
music records). The three types of nodes corresponded to the users, the tags, and the 
items. The tagging was carried out without any central authority and according to the 
results, the analysis of the bi- and unipartite projection of the networks can help in 
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structuring the contents (e.g., define a hierarchy between the tags). 

In this paper we study tagged networks from yet another point of view. Our focus 
is on networks where the links are in principle not related to tagging, however tags 
can be associated with the nodes quite naturally. The PACS numbers or key-words 
in case of co-authorship networks, the scope of business or the industrial sector of 
companies in the context of financial networks, or the status of employees in the case of 
a network representing the social ties inside a large firm provide plausible examples for 
possible tags. The complexity of the networks studied these days is rapidly increasing 
together with their size. The use of tags associated with the nodes can help in revealing 
hidden structures or fasten searching within the networks. Since the usefulness of such 
attributes has already been proven in biology, the inclusion of tags in the analysis of 
other networks as well is expected to give a deeper understanding of the interrelations 
shaping the structure and dynamics of the systems under study. 

Along this line, in the present paper we study the fundamental statistics 
characterising the distribution of tags in large annotated real networks. By choosing 
networks representing completely unrelated systems (a co-authorship network, a protein 
interaction network, and the English Wikipedia), we seek for signs of universality in 
these statistics. Furthermore, we are interested in the relations between the network 
topology and the distribution of tags. The tags enable the definition of a similarity 
function between the nodes which is a priori independent of the topology. We shall 
refer to this quantity as the tag- similarity of the nodes in order to distinguish it from 
the usual structural similarity of the nodes (based on the similarity between the nearest 
neighbours). The study of the tag-similarity opens up further directions for exploring the 
intricate relations between the annotations and the graph structure itself. Interestingly, 
in all selected systems, the tags form a sort of taxonomy: they correspond to features 
ranging from very specific to rather general ones, which are embedded in a hierarchic 
structure held together by "is a sub-category of" type relations. This inter-relatedness 
of the tags adds an extra twist to the definition of the quantities we study. 

The paper is organised as follows. In SectfSJ we define the most important 
quantities we aim to study, whereas the construction of the investigated networks (and 
the hierarchy of the corresponding node labels) is detailed in SectJHl The results are 
presented in Sect JH and we close the paper with with some concluding remarks in Sect El 

2. Definitions 

2.1. Basic statistics 

2.1.1. Number of tags on a node In principle, nodes in a network can be tagged with 
almost anything. Here we list a few basic types followed by particular examples in 
parenthesis: real numbers (the accumulated impact factors of authors in a co-authorship 
network), integers (the number of articles of an author), or character strings (functions 
of proteins in a protein-protein interaction network). However, in most cases, (including 
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Figure 1. A small labelled sub-graph and the corresponding DAG of categories in the 
English Wikipedia. tn the left panel we show a few neighbours of the page "Gladiator" , 
where the connections correspond to mutual hyper links between the pages embedded 
in the text of the page. At the bottom of each page we can find a list of categories, 
which we use as tags. These are listed in the frames appearing near the nodes. These 
categories are organised into a DAG, as demonstrated in the right panel, where e.g., 
"Gladiator types" is a sub-category of "Gladiatorial combat" . The categories appearing 
in both panels are emphasised in black. 



the systems we study in the present paper), the node attributes correspond to character 
strings, chosen from a finite set of possible tags. Usually a node can have more than 
one tag attached to it, e.g., numerous proteins appearing in a protein-protein interaction 
(PPI) network have multiple functions. One of the basic statistics about the annotations 
is the distribution of the number of tags on the nodes. 

2.1.2. Tag frequencies Similarly to the varying number of tags on the nodes, the 
frequency of the different tags can also be rather heterogeneous. What makes the picture 
even more complex is that in many cases the tags refer to categories of a taxonomy or 
ontology (capturing the view of a certain domain, e.g., protein functions). This means 
that the tags are organised into a structure of relationships which can be represented 
by a directed acyclic graph (DAG), where the directed links between two categories 
represent an "is a sub-category of" relation. The nodes close to the root in the DAG are 
usually related to general properties, and as we follow the links towards the leafs, the 
categories become more and more specific. In some cases we can find categories in the 
DAG with more than one in-neighbours, meaning that the given sub-category is part 
of more than one categories (that are not parts of one another) . Also note that nodes 
can be classified not only by the leaf-categories e.g., several proteins in a PPI network 
can be found with rather general functional descriptions. We illustrate the concept of 
tagged networks and the corresponding DAG of categories in FigHJ with the help of the 
English Wikipedia. 

Given the DAG between the possible tags, we can define the frequency of a given 
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tag a in two different ways: 

Pa = N a /N, (la) 
p a = N a /N, (lb) 

where N a denotes the number of nodes tagged with a, N a stands for the number of 
nodes tagged with a or any of its descendents, and N is equal to the total number 
of nodes in the network. From these definitions it follows that when the number of 
un-tagged nodes is zero, the root of the annotation DAG will receive p a = 1, whereas 
for the leaf categories p a = p a . Furthermore, if category (3 is a descendent of a, then 
Pa > P/3- Low frequency tags are more specific in an information theoretical sense, 
whereas high frequency tags carry almost no information (e.g., being tagged by the root 
in the annotation DAG adds absolutely no information to the description of a node). 

In the following, we shall refer to the sub-graph induced by the nodes (i.e. 
constituted by these nodes and all links between them) marked by the tag a and any 
of its descendents as the tag-induced sub-graph of a. The number of nodes in this 
sub-graph is given by N a , whereas the number of links can vary between M a = 
and M a = N a (N a — l)/2. It is interesting to compare M a to the number of links 
A^rand one would expect in a random sub-graph of the same size: if M a is significantly 
larger /smaller than M ran d, then nodes sharing the tag a attract /repel each other in the 
sense that they are linked with higher /smaller probability than at random. 



2.2. Tag-similarity 

Our aim in this section is to define a similarity function between the nodes which is 
based solely on the tags, therefore, it can be evaluated without any knowledge about 
the graph structure. Although we refer to this quantity as the tag-similarity of the 
nodes in general, we shall use the term similarity in the same sense for short. 



2.2.1. Simple similarity measures To what extent two nodes i and j having a set of tags 
fij and flj are similar is a question far from trivial, as the number of possible similarity 
measures is vast. A simple approach is to use the Jaccard-similarity [52] defined as 

( j) = \Qj nflj| 

where \Qi D Qj\ is equal to the number of common tags and |Oj U Qj\ is equal to the 
total number of different tags in fij and Qj. Another possibility is to represent the 
annotations as vectors v, and Vj, where the number of entries in the vectors is equal 
to the number of different tags in the network, and the non-zero elements indicate the 
presence (or possibly the weight) of the actual tags on the given node. In this approach 
the cosine similarity 



4? - ihrh (3) 



v, ; • v 



l v ii 



yields a simple similarity value for a pair of nodes i and j. 



Fundamental statistical features and self-similar properties of tagged networks 



6 



The advantage of the above methods is that they do not depend on the DAG 
between the tags, therefore, they can be applied even when the tags are not part of 
a structured taxonomy. However, when a tag refers to a sub-category of another tag, 
the similarity measure should be refined. As an example, let us assume that node i 
is tagged exclusively with category a, and node j has a single tag /?, that is a direct 
descendent of a (e.g., a =" knife" and (3 =" kitchen knife"). In this case both (jSJ) and 
([3]) yield s\j = s\f = 0, which is not what we would expect. 

2.2.2. Semantic similarities To overcome the problem raised above, we should use a 
similarity measure which takes into account the structure of the annotation DAG. At 
this point we divide the evaluation of similarity into two parts: first we deal with the 
similarity s a p between a pair of tags, then elaborate on how to combine the pairwise 
similarities s a p, a G (3 G VLj between the sets of tags Qj associated with a pair of 
nodes % and j to obtain Sij. 

A simple choice for determining the similarity between two tags is the length of 
the longest shared path towards the root of the annotation DAG. A somewhat more 
sophisticated approach is to use semantic similarities. The basic idea behind these 
methods is to take into account the frequency of the tags: sharing a rare tag by two 
nodes should indicate high similarity, whereas sharing a frequent tag should not. The 
semantic similarity between tags a and f3 derived by Resnik [53] as 

= max .[-log p 7 ], (4) 

where F(a,/3) denotes the set of common ancestors of a,/3, and — logp 7 corresponds 
to the information content of category 7. From this definition it follows that if (3 is 

/r>\ 

a descendent of a, then s a/3 = — logp a , and when the two compared tags are not 
connected by a directed path, then s a/3 is equal to the information content of one of 
their nearest common ancestors. A closely related similarity measure was proposed by 
Lin [M] as 

(L) _ 2max 7 g r(a/3 ) [-logp 7 ] , . 

8< #- |logp Q + log^| ■ U 

In practice fl5]) was reported to slightly under perform [55] . however the big advantage 
of (JSJ is that s^l becomes bounded in [0, 1]. The maximal possible s^) obtained from 
P| depends on the frequency of the rarest tag, which in our case is strongly varying 
from system to system. For this reason, we shall use (jSJ) for calculating the similarity 
between categories. 

When moving from the similarity of tags to the similarity of nodes, again we have 
a number of possibilities to choose from. A simple approach is to use the average of the 
pairwise similarities as 
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where rij and rij denote the number of tags on node i and j respectively. The problem 
with the expression above is that if the labels associated with a given node are very 
different from each other, then by comparing this node even to itself, the "cross-terms" 
reduce the similarity value. A simple solution is to replace the average in dSJ) by the 
maximal pairwise similarity amongst the tags: 

Sij = max s^l. (7) 

Another possibility along this line is to organise the pairwise similarities between the 
tags into an by rij matrix, and define the quantities rowScore and columnScore as the 
average of the maximal values in the rows and columns of this matrix respectively. The 
similarity between the two annotation vectors can then be given as either the average 
or the maximum of rowScore and columnScore [56] • In our studies we shall use (J?)) 
due to its computational simplicity and the fact that it is analogous to the concept of 
minimum linkage clustering, where the distance between two sets of elements (the tags) 
is defined as the minimum pairwise distance between the elements. 



2.3. Tag-assortativity 

A plausible hypothesis about tagged real networks is that links are likely to form ties 
between similar nodes and vice versa, we expect connected nodes to share common tags 
with enhanced probability. However, this property is not evident in all cases. E.g., if we 
colour the nodes in a network according to the famous vertex colouring problem [57J, 
(namely we seek for the minimal number of colours which can be distributed in such 
a way that no neighbours have the same colour), and identify the node colours as the 
tags, then similar nodes are actually never connected. 

In general, the property that nodes are more frequently connected to others that 
are similar/different in some quality is referred to as assort at ivity/disassort at ivity. The 
most typically considered quality - which is based on the network's topology - is the 
degree of the nodes. In tagged networks, however, another natural way of comparing 
nodes can be based on the above defined tag-similarity. We can thus introduce the 
notion of tag-assortativity (to distinguish this property from the degree-assortativity), 
and call a network tag-assortative/tag-disassortative if nodes having similar tags are 
linked with higher/lower probability than at random. 



2.4- Uniqueness 

Interestingly it is not uncommon to find tags associated with the same node which are 
rather different from each other, e.g., in the PPI network studied in this paper more 
than 10% of the nodes have at least one pair of tags for which the nearest common 
ancestor is actually the root of the annotation DAG. This means that the given protein 
can take part in very different biological processes. On the other hand, many nodes 
have more or less similar categories in their annotation, so they take part in more or 
less similar processes. 
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To quantify the above aspect, we introduce the node uniqueness, defined as 
min 



Ui = min (8) 



In principle, we could have chosen s^) rather than s^j in the definition above. However, 
since s^t,l = 1 for every a, if node i has only a single tag, then Ui would be unity 
independent of whether this tag is frequent or not. By using the the Resnik-similarity 
(j4|) for which Sq R 2 = — logp a , we can differentiate between nodes with single tags as 
well, based on the tag frequencies. The lowest possible value for u occurs in the case 
where a node belongs to more than one categories, out of which at least two have the 
root of the DAG as their nearest common ancestor. The highest possible value for u 
occurs if a node belongs to a single category, and this category happens to be the rarest 
among all. We note that in Ref. [51] a closely related quantity called node diversity was 
defined for the case where the tags are not part of a hierarchical taxonomy. 



3. Applications 

We studied the node annotations in three networks of high importance from the 
aspect of practical applications, capturing the relations between interacting proteins, 
collaborating scientists, and pages of an on-line encyclopedia. The PPI network of 
MIPS [58] contained N = 4546 proteins, connected by M = 12319 links, and the tags 
attached to the nodes corresponded to 2067 categories describing the biological processes 
the proteins take part in. The DAG between these categories was obtained from the 
Genome Ontology database [59] . 

The investigated co-authorship network is known as the MathSciNet (Mathematical 
review collection of the American Mathematical Society) [60], and represents the 
M = 873775 links of collaboration between N = 391529 mathematicians. The node 
tags were obtained from the 6499 different subject classes of the articles, which were 
organised into a DAG. Thus, the set of tags attached to each author was the union of 
all subject-classes that appeared on her/his papers. 

Finally, the nodes in the third studied network corresponded to the N = 1473894 
pages of the English Wikipedia [EU E21 E31 [64], connected by the M = 3755485 
hyperlinks embedded in the text of the pages. At the bottom of each page, one can 
find a list of categories, which were used as node tags. Since each wiki-category is a 
page in the Wikipedia as well, we removed these pages from the network to keep a clear 
distinction between nodes and attributes. Furthermore, we kept only the mutual links 
between the remaining pages. Similarly to the biological processes in the MIPS network 
or the subject classes in the MathSciNet, the wiki-categories can have sub-categories and 
are usually part of a larger wiki-category. However, when representing these relations as 
a directed graph, some directed loops appear, therefore, they do not form a strict DAG 
as required for e.g., the semantic similarity measures (JU[5]). In order to be able to use 
these similarity functions, we removed a few relations from this graph until it turned 
into a DAG, following a method detailed in the Appendix. 
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Figure 2. The density distributions of the tag frequencies a) p a and b) p a on 
logarithmic scale. 



Due to the very large size of this network, some of the analysis we carried out turned 
out to be very time consuming, therefore, in certain cases we used only smaller sub- 
graphs of the Wikipedia, induced by rather general categories e.g., "Soccer", "Japan", 
etc. (The tags which were not descendents of the chosen category were naturally dropped 
from the nodes in the tag- induced sub-graph). The advantage of this method is that the 
categories appearing as node tags in the resulting sub-graph also form a DAG which is 
equivalent to the DAG of the descendents of a (in which the root is a). In this paper 
we show the results for the case where a = "Japan", (altogether N = 43307 nodes, 
M = 102753 links and 3197 sub-categories), however other choices resulted in very 
similar results as well. 

We also checked whether this sort of sampling from the networks distorts the studied 
statistics by examining tag-induced sub-graphs in the other two networks (and smaller 
tag-induced sub-graphs in the Wikipedia/ Japan network) as well. We found that for all 
statistics studied in this paper the results in a large enough tag-induced sub-graph are 
very similar to those for the whole network, and the differences can be mostly attributed 
to the different system sizes. 

4. Results 

4-1. Basic statistics 

We begin our investigations in Figf5J with the distribution of the tag frequencies in 
the three networks. According to Figj2h., the distribution of p a resembles a power- 
law for the MIPS network and the Wikipedia, whereas it resembles an exponential 
distribution for the MathSciNet. When moving from p a to p a (by including the nodes 
tagged with any descendents of a as well), the tail of the distribution becomes power- 
law like for each network, as shown in FigfSb. This is consistent with the hierarchical 
nature of the annotation DAG: categories high up in the DAG correspond to general 
concepts, therefore apply to a vast number of nodes, whereas leaf categories (without 
any descendents) refer to something specific, therefore occur rarely [65l |66| 167]. 
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Figure 3. a) The density distributions of the number of tags n per node, b) The 
average number of tags (n) as a function of the node degree k. 

Our main goal in this paper is to study the relations between the distribution of 
node tags and the network topology. One of the most basic statistical quantity which can 
be studied in this respect is the number of rii tags for each node i. In Figj3^ we display 
the density distribution of in the studied systems, whereas Figj3b shows the average 
number of tags, (n) as a function of the node degree. Since the range of the possible 
n values is rather wide (especially in case of the MathSciNet), we used exponentially 
increasing bin sizes in FigGk. The decay of the distributions towards large n values 
seems exponential. Concerning the curves shown in Figj3b, a plausible hypothesis about 
tagged real networks is that they show tag-assortativity, namely links form ties between 
similar nodes more frequently than at random. Therefore, we expect connected nodes to 
share common tags with enhanced probability. Consequently hubs are expected to have 
a larger number of tags than nodes with small degrees, since they have to share common 
attributes with a large number of other nodes. Interestingly, in Figj3b the MathSciNet 
behaves as expected from this point of view (with a monotonously increasing (n) (k) 
curve), whereas the MIPS network and the Wikipedia do not. For both networks, 
(n) (k) is increasing at small degrees, then in case of the MIPS network it saturates, 
whereas for the Wikipedia it even drops down at high degrees. This implies that the 
simple picture shown above, in which the hubs correspond to versatile nodes with a 
large number of different tags does not hold in these systems. 

4-2. Tag-induced sub-graphs 

Due to the size of the entire Wikipedia, we have been able to analyse only some of 
its tag- induced sub-graphs, as described in Sect 01 To get a better understanding of 
the relationship between tag distribution and network topology, it is very insightful 
to go further down this line, and compare some of the basic properties of the tag- 
induced sub-graphs for every category (all the way from the root to the leafs) in all 
of our three networks. The scatter plots in FigJU with gray symbols depict the link 
number (M) vs. node number (N) relation for each category. M has a maximum of 
M max (iV) = N(N — l)/2, when the sub-graph forms a clique, i.e., each node is linked to 
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Figure 4. The scatter-plot of the number of links, M versus the number of nodes, N 
in the tag-induced sub-graphs of the different categories (gray symbols) for the MIPS 
network (a), the Wiki- Japan network (b) and the MathSciNet. The black symbols show 
< M >, whereas the solid lines correspond to the best power-law fit to < M > (N). 
The dot-dashed lines and the dashed lines in each plot show the upper bound in M 
and the expected number of links for a randomly chosen nodes respectively. 



all the others. This upper bound is shown with a dashed-dotted line. The estimate of 
the number of links M ran d(iV) = pN(N — l)/2 = pM max (N) between randomly selected 
N nodes, is also plotted with a dashed line, where the linkage probability is defined as 
p = M/[N(N — l)/2]. According to the scatter plots, in all the three systems the number 
of links M in every tag-induced sub-graph (with some exception at M = 0) exceeds the 
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number of links M ran( j expected for a link distribution that is uncorrelated to the tag 
distribution. This strongly indicates that the networks under study are tag-assortative. 

An even more intriguing property of the scatter plots is that if the average number 
of links < M > are plotted (with black symbols) as a function of the number of nodes 
iV (using logarithmic binning), then they strictly follow a power law < M >~ N 11 
(solid lines) for several orders of magnitude (with a deviation only at the smallest 
sub-graphs). The tag-assortativity exponent /i, defined by this power law, takes the 
values of 1.30 ± 0.02, 1.16 ± 0.02, and 1.18 ± 0.01 for the MIPS, the Wiki- Japan, and 
the MathSciNet networks, respectively. The physical meaning of this exponent can be 
demonstrated by considering the relation between the tag-induced sub-graph of some 
category and those of its sub-categories. If the tag-induced (not necessarily disjoint) sub- 
graphs of the sub-categories inherit all the links of the parent category "homogeneously" 
and without having inter-sub-graph links (i.e., having no links between any pair of 
sub-graphs other than those originating in the intersection), then the number of links 
corresponding to a sub-category is expected to scale linearly with the number of its 
nodes, implying If, however, inter-sub-graph links also appear (cf. Figj5]), then 

the number of links corresponding to a sub-category is expected to drop faster than 
linearly, leading to /i > 1. Although /i < 1 cannot be ruled out (at least locally, between 
a particular category and its sub-categories), it requires very peculiar topologies (e.g. 
large link density in the intersection between the tag-induced sub-graphs of two sub- 
categories) and, thus, we do not anticipate to obtain such values for real systems. 

In brief, a value of > 2 indicates tag-disassortativity; /i = 2 characterises no 
correlation between tag-similarity and link distribution (cf. M ran d); whereas < jj, < 2 
is the regime of tag-assortativity with the amendment that < \i < 1 would represent 
extreme tag-assortativity. This classification scheme affirms that the tag-assortativity 
exponent fi defined above is indeed an appropriate quantity for characterising the extent 
of tag-assortativity. Our finding that its value for the three networks we have studied is 
closer to 1 than to 2 suggests that these networks exhibit a significant tag-assortativity, 
MIPS being somewhat less tag-assortative than the other two. 

Both the fact that the statistical properties of tag-induced sub-graphs are similar 
to those of the entire graph and also the fact that a single well defined exponent 
characterises tag-assortativity over several orders of magnitudes of the sub-graph size 
imply prominent self- similarity in the structure of tagged networks. Briefly speaking, 
the tag-induced sub-graph A of some category a is related to the tag-induced sub-graphs 
Bi C A of its sub-categories ft C a statistically the same way as the sub-graphs Bi of 
categories ft to the tag-induced sub-graphs Cy C Bi of their sub-categories 7^- C ft, as 
demonstrated in Figj5j i.e. both the network topology and the tag distribution appear 
to be scale invariant. 
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Figure 5. Demonstration of the self-similar nature of recursively embedded tag- 
induced sub-graphs CV, C C A generated by the DAG of hierarchically organised 
categories 7-y C /5j C a. The grey level is indicative of the link density. 

4-3. Similarity 

The introduction of a similarity measure based on the node tags enable us to study other 
type of relations between the topology and the annotations as well. In Figj6] we follow 
the change of the similarity between the nodes with the distance in the three networks. 
The right column of the figure shows the density distribution p(sjj) for Sjj obtained from 
([7]), whereas the left column displays the corresponding average similarity, < > as a 
function of the node distance d. The p(sjj) distributions are shifted towards lower 
values with increasing distance d between the nodes and accordingly a rapid decreasing 
tendency can be observed in the < > (d) function at small distances. At medium 
node distances < > becomes more or less constant, suggesting that the nodes become 
independent of each other. In consistency with the results of Sect J4.2I . this is another 
indication of tag-assortativity: if links were drawn between the nodes at random, the 
< Sij > would be independent of the distance between the nodes (the < > (d) would 
resemble a flat line). The prominent peak at distance d — 1 signals that neighbouring 
nodes are much more similar to each other than at larger distances and much more 
similar to each other than at random as well. 

At large node distances the number of pairs is rapidly decreasing (i.e., at the possible 
maximum distance only a few pairs of nodes can contribute to < s^j >). To indicate that 
the number of samples in this regime is not enough for a significant statistical analysis, 
we changed the filled symbols (and solid lines) to empty symbols (and dashed lines) in 
FigsEbJEH and [6k. Interestingly, for the MIPS network < Sjj > (d) becomes increasing 
in this region, reaching a value at the maximal distance d max almost as high as at d = 1. 
However, this is due the fact that the five nodes making up the pairs at d max happen 
to be more similar to a randomly chosen node than average. (Nodes having a couple 
of non-specific tags can be indeed quite similar to the majority of the nodes). Since 
the number of pairs at large d is small, the contribution from these nodes is significant, 
and < > becomes larger than at medium d, where the vast number of other nodes 
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Figure 6. The similarity as a function of the distance between the nodes. The 
density distribution of at various distances is plotted on semi-logarithmic scale for 
the MIPS network, the Wiki- Japan network and the MathSciNet in panels (a), (c) 
and (e) respectively. The corresponding average similarity, < Sjj > as a function of 
the node distance d is shown in panels (b), (d) and (f). The number of pairs at large 
d becomes small, therefore, the results for < Sij > in this regime cannot be trusted. 
The empty symbols and dashed lines indicate that the number of pairs has decreased 
below the total number of links in the network. 



counter balance this distortion. 
4-4- Node uniqueness 

We now move on to the investigation of the node uniqueness, defined in (jSJ). Our main 
interest concerns the dependence of u on the node degree. We divide the nodes into 
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Figure 7. The participation ratio of the nodes in the three node uniqueness classes 
as a function of the node degree k for the MIPS network (a), the Wikipedia/ Japan 
network (b) and the MathSciNet (c). 

three classes of equal size depending on their u value: specific nodes have relatively high 
u (marked by either a rare label or a few closely related rare labels), medium nodes 
have a u value around the average, whereas diverse nodes have a relatively low u value 
(marked by frequent or un- related labels). In FigfTl we show the participation ratio 
of the nodes in the three classes as a function of the node degree k. Again, the three 
systems show different behaviour. In case of the Wikipedia and the MathSciNet, the 
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ratio of diverse nodes is increasing monotonously with the node degree. This tendency 
is very pronounced in the latter network (FigJTfc), where in fact all hubs are classified as 
diverse above a certain degree. This is consistent with the steady increase in the average 
number of tags as a function of the node degree in Figj3b (square symbols): for nodes 
with n ~ 10 2 tags we expect to find at least a few pairs of rather un-related categories 
resulting in a low u value. In contrast, for the MIPS network, the monotonous increase 
in the ratio of diverse nodes with the node degree is followed by a sudden drop at the 
largest degrees. This means that a significant portion of the hubs in this network have 
rather specific functions. 

5. Summary and conclusion 

We studied the basic statistical properties of tags in real networks, with an interest in the 
relation between the topology and the tag distribution. We found that the investigated 
systems show universal features in some aspects with interesting differences from other 
perspectives. At small and intermediate degrees the average number of tags per node is 
increasing with the degree, and accordingly the node uniqueness is decreasing. For the 
MathSciNet this tendency is prolonged in the high degree regime as well. In contrast, 
the number of tags on the hubs in the MIPS network drops down and simultaneously 
the ratio of nodes with large uniqueness becomes increasing. The behaviour of the 
English Wikipedia is somewhere in between: the number of tags saturates for the hubs 
and the further increase in the ratio of nodes with low uniqueness is marginal. This 
comparison reflects the difference in the behaviour of hubs in these networks: the hubs 
of the MathSciNet are very versatile with huge amounts of different tags and low values 
of uniqueness, whereas in the MIPS network a significant portion of the hubs correspond 
to proteins with rather specific functions. 

We introduced the tag-similarity of nodes, which (in contrast to the usual structural 
similarity) can be calculated independently of the graph topology, and is based on the set 
of tags associated with the nodes. According to our results, the studied real networks 
show tag-assortativity: the similarity is decreasing with the node distance at small 
range and reaches a minimum at medium distances. In other words, tag-similar nodes 
are linked with each other at higher probability than at random. The tag-assortativity is 
supported by the investigation of the tag-induced sub-graphs as well, since the number 
of links between the nodes sharing a given tag is always larger than (or at least equal 
to) the number of links expected at random. 

An even more interesting property of the tag-induced sub-graphs is that the average 
number of their links follow a power law as a function of the number of their nodes for 
several orders of magnitude. The magnitude of the tag-assortativity exponent /x, defined 
by this power law is in close relation with the tag-assortativity property of the network: a 
value of fj, > 2 indicates tag-disassortativity; fj, — 2 characterises no correlation between 
tag-similarity and link distribution; whereas < /i < 2 is the regime of tag-assortativity 
(with < ji < 1 representing extreme tag-assortativity). The tag-assorativity exponent 
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was slightly above 1 for all studied networks in our case. 

The above scaling also reveals that the structure of the studied tagged networks is 
self-similar. This is supported by the fact that the statistical properties of tag-induced 
sub-graphs are similar to those of the entire graph. This means that in the statistical 
sense, the network is related to a sub-graph induced by a given category a in the same 
way as this sub-graph is related to the tag-induced sub-graph of a descendent of a, i.e. 
both the network topology and the tag distribution are scale invariant. 

Appendix 

Preparing a Directed Acyclic Graph (DAG) from the category hierarchy of the English 
Wikipedia 

In Wikipedia the classification terms of each page (appearing at the bottom of the 
page) are called categories and are arranged into a hierarchy, i.e. a directed network 
where a more general term is connected to each of its child terms via a directed link. 
It is important to note that this directed graph contains cycles (loops): a closed path 
of nodes where each node (a category) is a sub-category of the previous one and the 
first is a sub-category of the last. Many of these loops are short and are made up of 
a small group of synonymous terms, e.g., the categories Hindustani and Urdu are very 
closely related and are both sub-categories of the other. An example for a longer loop 
is Education: Social sciences: Academic disciplines: Academia: Education, and a loop 
of length 22 has been found, too, in the English Wikipedia |68j . 

Loops in the category hierarchy can confuse both readers and search engines, and 
prohibit a tree-based semantic analysis of annotations. For example, with loops it would 
impossible to identify the closest common ancestor(s) of two arbitrary terms and decide 
their level of relatedness. To delete all loops from the hierarchy of Wikipedia categories, 
first we devised an algorithm eliminating all loops from a generic directed network by 
sequentially removing single directed links and modifying the directed network by the 
smallest possible amount. Then, we applied the algorithm to the directed network 
defined by the category hierarchy of the English Wikipedia. 

The algorithm can be applied to an arbitrary directed network (nodes connected 
with directed links) and it has two parts. First, it identifies the "loop sub-graph" of 
the full directed graph, the set containing precisely the directed links of all loops. This 
is achieved by an iteration where in each step all directed links are removed that have 
either a start node that is a source (no incoming link) or an end node that is a drain (no 
outgoing links). Neither of these two node types (source and drain) can be in a loop. 
Repeating this removal step until at least one node is removed lead to a sub-graph 
containing precisely the loops of the full graph. Note that the loop sub-graph may have 
more than one graph component. 

The second step of the algorithm identifies a set of directed links (L) whose removal 
from the loop sub-graph eliminates all of its loops. As the loop sub-graph is by definition 
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the set of loops of the original graph, removing the same directed links from the full 
graph will eliminate its loops. We selected the set of removed links, L, with the goal to 
modify the full graph by the smallest possible amount. This concerns not only the size of 
L (the number of links removed), but also selecting links with the smallest significance as 
viewed from the full graph. Turning back to one of the above examples, one has to decide 
which of the two directed links "Urdu is a sub-category of Hindustani" or "Hindustani 
is a sub-category of Urdu" is less relevant from the point of view of the entire directed 
network. More generally, suppose that in a (directed) network the directed links A-^B 
and B— >A are both present. To eliminate the loop A ^B^A, one of the two links has 
to be removed. 

To decide which of the two links is less significant, consider another example. In a 
directed network with the four links M— >A, A— >B, B— >A and B— >N, the link A-^B 
is more important than B^A, because it is contained by a long continuous path, 
M— >A— >B— >N. On the other hand, B— >A points in the opposite direction, thus, it is 
likely to be a "side effect". The difference between these two links can be measured. 
The number of point-to-point shortest directed paths passing through A— >B is larger (3: 
M— ►N, A— >N and A— >N) than the number of those containing B— >A (only 1: B— »A). 
In a directed network the number of shortest paths passing through a given (directed) 
link is called the directed betweenness centrality of that link. Multiple shortest paths 
between two nodes are accounted for by weighting, see e.g., Ref. [2] for the undirected 
case. Based on the above observation, we quantified the significance of each directed 
link by its directed betweenness centrality, B, as measured in the full network. 

Now let us return to the second part of the algorithm starting from the loop sub- 
graph. Knowing B of each link in this sub-net, we can select and remove the least 
important link, i.e. the one with the lowest B value. This link removal may produce 
source nodes (only outgoing links) and drain nodes (only incoming links). Again we 
iteratively remove links not contained by loops until the remaining network "melts 
down" to the set of remaining loops. We repeat this step - deleting the link with smallest 
B and then iteratively removing all non-loop links - until no more links remain. We 
save the set of removed links, L, and remove the same set of links from the full graph 
to eliminate all of its loops by modifying it by the smallest possible amount. 

The full category hierarchy of the English Wikipedia (Oct/17/2007 version) 
contains 265 432 nodes (categories) and 543 722 directed links (category - sub-category 
connections). The loop sub-graph has 4 980 nodes and 13 164 (directed) links. The total 
number of removed links was \L\ = 3 977. Data together and processing programs can 



be downloaded from the website http://CFinder.org 
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